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Abstract 

Copy number variation (CNV) is a nnajor genetic polymorphism contributing to genetic diversity and liuman evolution. 
Clinical application of CNVs for diagnostic purposes largely depends on sufficient population CNV data for accurate 
interpretation. CNVs from general population in currently available databases help classify CNVs of uncertain clinical 
significance, and benign CNVs. Earlier studies of CNV distribution In several populations worldwide showed that a significant 
fraction of CNVs are population specific. In this study, we characterized and analyzed CNVs in 3,017 unrelated Thai 
Individuals genotyped with the lllumlna Human610, lllumlna HumanOmnlexpress, or lllumlna HapMap550v3 platform. We 
employed hidden Markov model and circular binary segmentation methods to Identify CNVs, extracted 23,458 CNVs 
consistently identified by both algorithms, and cataloged these high confident CNVs into our publicly available Thai CNV 
database. Analysis of CNVs in the Thai population Identified a median of eight autosomal CNVs per individual. Most CNVs 
(96.73%) did not overlap with any known chromosomal imbalance syndromes documented In the DECIPHER database. 
When compared with CNVs in the 11 HapMap3 populations, CNVs found in the Thai population shared several 
characteristics with CNVs characterized in HapMap3. Common CNVs in Thais had similar frequencies to those In the 
HapMap3 populations, and all high frequency CNVs (>20%) found in Thai individuals could also be identified In HapMap3. 
The majorities of CNVs discovered In the Thai population, however, were of low frequency, or uniquely identified In Thais. 
When performing hierarchical clustering using CNV frequencies, the CNV data were clustered into Africans, Europeans, and 
Asians, In line with the clustering performed with single nucleotide polymorphism (SNP) data. As CNV data are specific to 
origin of population, our population-specific reference database will serve as a valuable addition to the existing resources 
for the Investigation of clinical significance of CNVs in Thais and related ethnicities. 
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Introduction 

Copy Number Variation (CNV) is one of the major genetic 
variations observed among genomes of individuals. CNVs 
constitute more total nucleotides than Single Nucleotide Polymor- 
phisms (SNP), accounting for almost 12% of the human genome, 
and are of important in terms of genetic diversity as well as human 
evolution [1]. At present, several conditions with genetic etiologies, 
such as autism spectrum disorder, developmental delay, and non- 
syndromic multiple congenital anomalies, are well documented to 
have CNVs among the causative variants [2]. For this reason, 
array-based technology, which is commonly used for CNV 



identification, has been recommended as a first-tier diagnostic 
tool for these particular disorders [3] . To make an accurate clinical 
interpretation of CNVs, both databases containing reference 
CNVs from genetic disease patients and normal controls are 
required. Large databases consisting of CNVs and clinical 
information of patients with chromosomal disorders such as 
DECIPHER [4] and the International Collaboration for Clinical 
Genomics (ICCG; http://www.iccg.org/) are actively curated by 
working groups. However, most patients are of European descent 
due to the availability and easy accessibility of clinical CNV testing 
in North America and Europe. Apart from these, there are 
currently a few other large public CNV databases containing CNV 
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information of control subjects from cxrtain ethnic groups, such as 
Caucasian, African- American, and Asian American [5,6]. These 
general population databases greatly help with clinical interpre- 
tation of CNVs, which can be divided into three main categories: 
pathogenic, uncertain clinical significance, or benign [7]. Recent- 
ly, publications focusing on CNVs of specific ethnicities such as 
Koreans [8], Europeans [9], and Chinese [10] emphasize the fact 
that there are significant amount of population-specific CNVs. So 
far the number of Thai individuals represented in the existing 
databases for CNV in general population is very limited [1 1], and 
thus they are by no means the ideal references for CNV 
interpretation in Thais. The International Haplotype Map Project 
phase III (HapMapS) has made publicly accessible SNP genotyp- 
ing and CNV data of more than a thousand subjects from 1 1 
different ethnic groups, e.g. European, African, and East Asian 
ancestries [12]. HapMapS dataset provides an opportunity to 
compare genetic variations across populations. Hence, CNVs in a 
larger sample of Thai individuals can be characterized and 
distinguished from those of East Asian and other populations. 

In this study, we combined the genomics data generated from 
multiple genome-wide association studies (GWAS) consisting of 
3,017 unrelated Thai sul)j(;cts with no undiagnosed genetic 
disorders. We carried out CNV discovery from these dataset 
using the two commonly used CNV caUing algorithms, PennCNV 
[13] and CNV Workshop [14], to identify the most accurate set of 
CNVs, and put together the first large reference CNV database for 
Thais. Furthermore, we performed population Copy Number 
Variation Region (CNVR) frequency comparison between Thais 
and 1 1 HapMap3 populations, and identified unique CNVRs in 
Thais as well as CNVs overlapping with genes associated with 
Thai population. Genetic similarity between each population was 
also explored using hierarchical clustering analysis (HCA) based 
on the CNV frequencies. The Thai CNV database should 
contribute to a more accurate clinical interpretation of CNVs in 
Thai patients and serve as the starting point for future population 
genetics and genetic epidemiology studies. 

Materials and Methods 

Study populations 

The study population were compiled from previously published 
genome-wide association studies (GWAS) in Thai individuals 
[15,16,17,18,19], which were generated under collaborations 
between the Ministry of Public Health, Thailand, Thailand 
Center of Excellence for Life Sciences (TCELS), and the RIKEN 
Center for Genomic Medicine (CGM), Japan (Table 1), and CNV 
data of 1 1 different ethnic: groups publicly available through the 
HapMap3 project (Table SI in File SI) [14]. This study was 
approved by a Committee on Human Rights Related to Research 
Involving Human Subjects, Faculty of Medicine Ramathibodi 
Hospital, Mahidol University. 

CNV discovery in Thai population 

Genotype data in the Thai population were generated using 
lUumina Human6I0-Quad, or lUumina HapMap 550v3, or 
lUumina HumanOmniexpress genotyping platform (lUumina, 
San Diego, CA, USA) (Table 1). Signal Intensity data in 3,427 
Thai individuals were obtained. Individual samples with SD of 
log-R ratio >0.3, with SNP call rate of <98%, or with self- 
reported/genotype-derived sex inconsistency were excluded leav- 
ing a total of 3,017 Thais prior to subsequent analyses. Intensity 
data of SNPs that were not in Hardy-Weinberg equilibrium 
(HWE) using a threshold of 10 ' were excluded prior to CNV 
prediction as previously described [17]. 



Two CNV prediction algorithms. Hidden Markov Model 
(HMM) and Circular Binary Segmentation (CBS), were used to 
call CNVs from signal intensity in the Thai population. CNV 
discovery using an HMM-based algorithm was performed with 
PennCNV software version 2011Junl6 [13]. Briefly, the intensity 
data of A and B alleles from raw files were extracted, normalized, 
and transformed into Log R Ratio (LRR) and B Allele Frequency 
(BAF) using GenomeStudio software (Dlumina, San Diego, CA, 
USA). Population frequency of B allele file (pfb) for Thai 
population was estimated and used together with HMM model 
file provided by P(:iiiiCN\' soft\\ arc'. LRR and BAF at each probe 
location were then used to predic:t one of the four possible states of 
CN\': homozygous deletion, heterozygous deletion, normal copy 
number, and at least one copy duplication. 

A CBS-based algorithm was implemented in CNV Workshop 
[14]. For CBS, LRR data were used to identify a segment in the 
genome that displays a change in signal intensity. Mean LRR and 
distribution of BAF were then used to predict how likely each 
segment of the genome is a copy number variant. CNVs were then 
called using default parameters. The CNV statistics illustrating the 
characteristics of HMM and CBS were summarized in (Table S2 
in File SI). 

CNVs in IlapMapS populations were downloaded from 
HapMap project website, http://HapMap.ncbi.nlm.nih.gov/ 
downloads/cnv_data/hm3_cnv_submission.txt on March 12, 
2014. Family information and population origin of the samples 
were obtained from Coriell Cell Repositories (http://ccr.coriell. 
org/) using an in-house python script. The same quality control 
criteria used to fdter CNVs in Thai populations were apphed to 
HapMap3 data. After excluding offspring, there were 79,517 
CNVs in 1 ,038 individuals left for subsequent analysis. 

Quality control of CNV data 

The CNVs predicted by HMM algorithm were verified with the 
results from CBS algorithm. For each subject, CNVs called by 
both HMM and CBS algorithms, with at least 60% overlapping 
length were considered repKcable. These overlapped regions were 
used as the start and end of CNVs in subsequent analyses. To 
minimize false positive results, we only included CNVs with at 
least 30 kb/SNP density, at least 5 SNPs (>5 kb) for deletion 
CNVs, at least 10 SNPs (>10 kb) for duplication CNVs. CNVs 
overlapped more than 50% with centromeric and telomeric 
regions, and CNVs on sex chromosome were excluded (Figure la). 
Individuals predicted to have more than 100 CNVs, most likely 
from an error from genotyping array, were also excluded [14]. 

CNV distribution among population 

Pairwise comparison of the frequencies of CNVs between Thais 
versus each of the 11 HapMap3 populations was performed using 
the test of association function implemented in PLINK (http:// 
pngu.mgh.harvard.edu/~purcell/plink/cnv.shtml) [20]. The em- 
pirical statistical significant level was calculated using 5,000 
permutations. CNVs with statistically significantly difiFerent 
frequency were defined as any CNV with empirical p-value < 
0.0002 (1 in 5000 chance). CNV loci encompassing genes were 
exclusively chosen, and their frequencies in each population were 
calculated. To identify CNVs with the greatest frequency 
difference between the Thai and each HapMap3 population, 20 
genes (p<0.0002) comprising of the top 10 genes with higher 
frequencies, and the bottom 10 genes with lower frequencies in the 
Thai population were selected for each of the pairs. These CNV 
frequencies across all populations were subsequently used to 
performed hierarchical clustering analysis. CNV frequencies were 
scaled and centered to have a mean of 0 and a variance of 1 . 



PLOS ONE I www.plosone.org 



2 



August 2014 I Volume 9 | Issue 8 | e104355 



CNV in Thais 



Table 1. GWAS studies containing the genomics data of 3,017 Thai individuals after exclusion of low quality samples. 



Reference 


Type of SNP array 


Number of subjects 


Total 


Excluded (%) 


Jongjaroenprasert et al, 2012 


lllumina Human610-quad 


289 


330 


1 2.424 


Mahasirimongl<ol et al, 2012 


lllumina Human610-quad 


463 


484 


4.339 


Wattanapokayakit et al (unpublished data) 


lllumina HumanOmniExpress-12 


517 


685 


24.526 


Chantarangsu et al, 2011 


lllumina HumanHap550-Duo v3 


56 


165 


66.061 


Chantarangsu et al, 2011 


lllumina Human610-quad 


167 


210 


20.476 


Mahasirimongkol et al, 2012 


lllumina Human610-quad 


856 


868 


1.382 


Nuinoon et al, 2010 


lllumina Human610-quad 


669 


685 


2.336 


Total 




3,017 


3,427 


1 1 .964 



doi:10.1371/journal.pone.0104355.t001 



Hierarchical clustering using Euclidean distance with Ward 
clustering method was performed on the scaled frequencies using 
pheatmap package in R version 3.0.1 [21]. 

Copy number variable region 

In this study, we applied a widely used term Copy Number 
Variable Region (CNVR) to represent a discrete region in the 
genome that overlaps with CNVs. After combining CNV data 
from the Thai and HapMapS populations, CNVRs were defined 
by merging overlapping CNVs into a discrete region using 



Genomic Ranges packages in R. The frequencies of CNVRs in 
each population were calculated by counting the number of 
individuals whose CNV(s) fell within each predefined CNVR 
divided by the total number of people in each population. CNVRs 
with at least 5 % frequency in Thais were defined here as common 
CNVRs. CVNRs overlapping with gene regions were identified 
using GenomicRanges package in R using the gene Hst based on 
hgl8 data downloaded from PLINK software resource page 
(http://pngu.mgh.harvard.edu/~purcell/plink/dist/glist-hgl8). 
To identify the degree of match between the Thai and HapMapS 



a) 



3,427 Individuals from 7 Thai 
population-based GWAS studies 



b) 



Sample exclusion criteria 

- SNP call rate < 98% 

- LRRSD>0.3 



3,017 Thai individuals 



1 



PennCNV 



CNVWorkshop 



CNV Exclusion Criteria 

- SNP density < 1 SNP per 30 kb 

- Deletion spanning < 5 consecutive SNPs 

- Duplication spanning < 10 consecutive SNPs 

- CNVs on sex chromosomes 

- CNVs with >50% overlap with centromeres and telomeres 



23,458 autosomal CNVs called by 
both PennCNV & CNV Workshop 
(> 60% overlapping length) 



Construction of 
Thai CNV database 



CNV comparison between Thais 
and the 1,038 individuals from 
HaplVlap 3 population 




< lOK lOK-lOOK > lOOK 
CNV size (bp) 

■ CNV Workshop 

■ PennCNV+CNV Workshop 



Figure 1. CNV discovery in the Thai population, a) Diagram showing Thai CNV discovery workflow; b) % overlap proportion of CNVs identified 
by both CNV Workshop and PennCNV based on CNV size (bp). The regions shaded in red correspond to CNVs exclusively discovered by CNV 
Workshop, while regions shaded in blue represent those jointly discovered by CNV Workshop and PennCNV. 
doi:1 0.1 371/journal.pone.01 04355.g001 
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Table 2. Thai CNV and CNVR characteristics. 







Thai CNVs 


Thai CNVRs 


Total count 


23,458 


1,014 


Duplication CNVs 


4,879 


165 


Deletion CNVs 


18,579 


538 


Complex CNVs 




311 


Median (mean) number per genome 


8 (7.77) 


7 (7.35) 


Median size (range) (kb) 


59.80 (5.0-4275.08) 


95.06 (5.18-4275.08) 


Median size of duplications 


122.76 (100.45-4275.08) 


137.34 (14.67-1491.4) 


Median size of deletions 


40.81 (5.0-3893.87) 


37.5 (5.18-2144.0) 


Genome coverage 




261.77 Mb (8.72%) 



doi:l 0.1 371/journal.pone.Ol 04355.t002 



CNVRs, CNVRs were created separately using either CNVs 
from Thais or the HapMapS populations. Only CNVs available 
in more than one individual were included in this analysis. 

Results 

Characteristics of CNV in the Thai population 

After excluding the samples with poor quality data including 
low SNP call rates and high TRR SDs, there were 3,017 



individuals left for subsequent analyses (Table 1 and Figure la). 
Among the 42,290 CNVs identified by CNV Workshop, 29,436 
CNVs (70%) were consistently predicted by PennCNV (Fig- 
ure lb). We extracted the most confident CNV dataset possible by 
excluding CNVs represented by sparse probe coverage (<1 SNP 
per 30 kb), small deletion (<5 consecutive SNPs) or small 
duplication (<10 consecutive SNPs). CNVs overlapping with 
centromeric and telomeric regions as well as sex chromosomes 
were also excluded due to a high false positive CNV prediction 



a. 




0.2 0.3 0.4 

Frequency 



-Thai 
-ASW 
-GIH 
-CHB 
-CHD 
-JPT 
-LWK 
-MKK 
-MEX 
-TSI 
-CEU 
YRI 



CNV size (kb) 



>50 

g 20-50 
& 

g 10-20 

3 

Qi 

t 5-10 

> 
z 

^ 1-5 
<1 



40 60 

Degree of match (%) 



Figure 2. CNV and CNVR comparison between the Thai and eleven HapMap3 populations, a) Size distribution of the Thai CNVs and 
HapMapS CNVs; b) Allele frequency spectrunn of CNVs with frequency of at least 1% across the Thai and HapMapS CNVRs; c) Degree of nnatch 
between the Thai CNVRs and HapMapS CNVRs with reference to allele frequency. 
doi:1 0.1 S71/journal.pone.01 04S55.g002 
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rates. After these filtering processes, there were 23,458 CNVs that 
passed these criteria, and thus kept for subsequent analyses. The 
CNV size ranged from 5 kb to 4.28 Mb, with a median of 59,804 
bp (Table 2). Up to 23 CNVs were identified in each individual, 
with the median number of eight CNVs per genome. Overall, we 
observed more deletion CNVs (79%) than duplication CNVs 
(21%). The median size of CNVs was 40,811 bp for deletion and 
122,757 bp for duplication CNVs. The higher amount of deletion 
CNVs as compared to duplication CNVs may reflect the power of 
current CNV calling algorithms to preferably detect smaller 
deletions. The largest CNV identified was a duplication of 4.28 
Mb spanning chromosome 12ql4 to 12ql5. Although this CNV 
overlapped with a known 12ql4 microdeletion syndrome docu- 
mented in the DECIPHER database v7.0 (https://decipher. 
sanger.ac.uk/syndromes) [4], there was no reported clinical 
significance of duplication of the same region. Among the 
23,458 CNVs identified in Thai population, 766 CNVs (3.27%) 
were overlapped with known chromosomal imbalance syndromes 
curated in the DECIPHER database with reference to matched 
CNV type. 

Thai CNVs versus HapMap3 

The overall frequency and size distribution of CNVs in the Thai 
population was relatively similar to the frequency distributions in 
1 1 other HapMap3 populations (Figure 2a and 2b). Comparing 
the CNV sizes, we observed that HapMap3 had higher amount of 
small CNVs (<50 Kb) than those in the Thai population 
(FigTjre 2a). When CNVs were combined into discrete regions 
within the HapMapS populations, after excluding CNVs found 
only in a single individual, there were 506 discrete CNVRs. 
Considering only CNVRs created from the Thai CNVs, 822 
(81.14%) out of 1 ,0 1 4 CNVRs did not overlap with any CNVRs in 
HapMap3, while there were 192 CNVRs that were common to 
both Thais and HapMap3. The median size of these shared 
CNVRs was 1 74. 1 kb, with a mean allele frequency of 2.65%. The 
median size of CNVRs found only in the Thai population was 
83.8 kb, with a mean allele frequency of 0.26%. The CNVRs 
shared between Thai and HapMap3 populations were both 
statistically significantly larger and more common than the Thai- 
specific CNVRs (p-value <0.001). All common CNVRs with 
frequency above 20% could be found in both Thais and 
HapMap3. As CNVRs became less frequent, the proportion that 
these CNVRs exist in both populations became lower (Figure 2c). 

After combining CNV data in the Thai population together 
with those from HapMap3, 2,560 CNVRs were defined. Most 
CNVRs (60%) were found only in one individual. Common 
CNVRs with at least 5% frequencies in the Thai population were 
summarized and contrasted with the HapMap3 populations in 
Table 3. The most common CNVR (hgl8 location chr4: 
69,045,672-69,258,302) in Thais was found on chromosome 
4ql3.2 overlapping with UGT2B15 and UGT2B17 (encoding 
Uridine diphospho-glucuronosyltransferases) in 1,564 individuals 
(52%). The CNVs overlapping UGT2B17 in most Thai people 
were found to be homozygous deletion (92.3%), similar to 
Japanese from Tokyo (JPT: 78.8%), Chinese from Beijing (CHB: 
75.0%), and Chinese from Denver (CHD: 73.8%). The proportion 
of people containing homozygous deletion of UGT2B17 was 
lowest in African population, with a frequency of 9.8% in Yoruban 
in Ibadan, Nigeria (YRI) and 9.5% in population with African 
ancestry in Southwest USA (ASW) (Table 4). 

In an attempt to identify additional CNVs overlapping with 
genes that might be either more common or less common in the 
Thai population compared to HapMapS, the frequencies of CNVs 
between the Thai population versus each of the 1 1 HapMap3 
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Figure 3 Hierarchical clustering analysis (HCA) of the 35 genes overlapping CNVs with statistically significantly different allele 
frequencies across HapMap3 populations as compared with Thais (permutation P-value <0.0002). The color bar on the right shows the 
color codes assigned to each frequency range in percent. 
doi:10.1371/journal.pone.0104355.g003 



populations were compared using a test of association. When a 
cut-off p-value of <0.0 1 was used, a total of 1 73 genes overlapping 
with CNVs were identified (Table S3 in File SI). To uncover the 
candidate genes representing Thai-specific CNVs, the top 20 
genes showing the greatest difference in frequency between each 
population pair were chosen, which resulted in a non-redundant 
list of 35 genes (Table S4 in File SI; p<0.0002). Hierarchical 
clustering analysis (HCA) was performed on the scaled frequency 
data of these gene-overlapping CNVs to group the most similar 
population together based on frequencies. Although these CNVs 
were picked to highlight the difference between Thai and 
HapMapS populations, the populations that showed the closest 
relationship to Thais were JPT, CHB, and CHD (Figure 3). Based 
on the HCA results, Asian populations were stiU the most similar 
to each other. Populations with European and African ancestries 
from HapMapS samples were placed in a different clade from 
Asian populations. Hence, cautions should be taken when 



interpreting CNVs found in the Thai population using CNV 
databases created based on subjects with European or African 
ancestry. 

The Thai CNV database 

As a significant fraction of CNVs were population specific, a 
web-based database containing CNVs identified in the Thai 
population was created to facilitate the clinical use of CNV in 
genetic diagnosis. The website is freely available, and can be 
accessed at http://thaicnv.icbs.mahidol.ac.th/thaicnv (Figure 4a). 
MySQL database schema is shown in Figure SI. The website 
allows users to query specific CNVs using either a genomic 
coordinate based on UCSC genome build hgl8 or hgl9. A list of 
CNVs identified in the Thai samples was listed in a table format, 
and a graphical interface of these CNVs was provided (Figure 4b). 
Links to a list of RefSeq genes overlapping with each CNV, and 
frequently used genome browsers namely UCSC genome browser. 
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Figure 4. Thai CNV database, a) A screen-captured image of Thai CNV homepage (httpy/thaicnv.icbs.mahidol.ac.th/thaicnv/); b) An example of 
CNV search page. Red and blue lines indicate deletion and duplication CNVs, respectively. Arrowheads indicate the starting and ending genomic 
locations. Panel I - input panel; panel II - graphical view; panel III - table view. 
doi:l 0.1 371/journal.pone.01 04355.g004 



Ensembl, DGV, DECIPHER, and NCBI dbVar can be browsed 
directly from the list of CNVs shown in a table format. Users can 
also limit the type of CNVs, deletion or duplication to only be 
shown in the table. Furthermore, users can choose to see CNVs 
that were called by each CNV-calling algorithm or a combination 
of algorithms. Graphical interfaces of reference CNVs from 
HapMapS and CHOP CNV (http://cnv.chop.edu/) were also 
provided for convenience. An example of unique CNVs in Thai 
population is shown in Figure 4b-II. 

Discussion 

We have established a large reference CNV database for Thai 
population, which contains CNVs from 3,017 unrelated Thai 
individuals whose high-resolution Illumina SNP array-derived 
GWAS data were previously published. These subjects consisted of 
patients with infectious diseases namely tuberculosis, leprosy, and 
HIV/AIDS, patients with Thyrotoxic H^'pokalemic Periodic 
Paralysis (THPP), and patients with Hb E/p-thalassemia. None 
of the subjects had other documented genetic disorders on top of 
their conditions at the time of diagnosis. Although Hb E/(3- 
thalassemia is a genetic disease, it is a known single gene disorder 
with autosomal recessive inheritance resulting from compound 
heterozygous mutations in the HBB gene. Therefore, it may be 
assumed that CNVs in these Thai individuals are mostly benign, 
although the possibility that some of these CNVs might be 
associated with disease susceptibility cannot be completely ruled 
out. We also confirmed that only 3.27% of the Thai CNVs were 
overlapped with the known chromosomal disorders in DECI- 
PHER v7.(). Our database is the most representative of the general 
Thai population to date, and is therefore suitable as a control for 
chnical interpretation of CNVs in Thai patients and related ethnic 
groups with potential genetic disorders. 

By using rigorous filtering criteria as well as a combination of 
two different algorithms for CNV calling to avoid potential 
algorithm-specific errors [22], we identified a median of eight high 
confident CNVs per Thai individual. This number is considerably 
fewer than the medians of each HapMap3 population [12]. This 
may be because only autosomal CNVs were included in our study, 
whereas Hapmap3 used two denser SNP arrays, combining 
Illumina IM with AffymetrLx 6.0, and thus allowed a higher 
number of smaller CNVs to be detected with confidence (Table S5 
in File SI). The estimated cumulative genome coverage of Thai 
CNVRs was 8.72%, which is similar to an earlier report using 
relatively homogeneous study population [8]. In accordance with 
HapMap3 study, the majority of Thai CNVs characterized were 
at low allele frequency, and the allele frequency spectrum of CNVs 
with >10% frequency was relatively similar between the Thai and 
HapMapS populations. However, the larger Thai population 
sample size may contribute to the higher absolute number of low 
frequ(-ncy CNVs in the Thai individuals observed. 

To identify Thai-specific CNVRs, we examined the degree of 
match between the CNVRs characterized in the Thai and 
HapMap3 subjects, and found that approximately 80% of Thai 
CNVRs did not overlap widi those of HapMapS. These CNVRs 
tended to be significantiy smaller with a mean allele frequency of 
only 0.26%. The high amount of rare Thai-specific CNVRs (< 
0.5% frequency) may be explained, at least in part, by the fact that 
there is no Thai individual included in the multiracial HapMapS 



study population. On the conttary, common Thai CNVRs (>5% 
frequency) showed a higher degree of match with HapMapS 
CNVRs, reflecting that common CNVRs are shared regardless of 
ethnicity. These findings are in agreement with a previous study 
showing comparison between Korean CNVRs and CNVRs 
derived from a public CNV depository database, DGV [8]. 

Furthermore, we determined a set of gene-overlapping CNVs, 
of which frequencies were statistically significantly difierent 
between Thais and each HapMapS population. Uridine dipho- 
spho-glucuronosyltransferase 2B17 (UGT2B17) was among the 
top S5 genes, and it was also overlapped with the most common 
Thai CNVRs. UGT2B17 is the most active enzyme in glucur- 
onidation of androgens, which is a major source for estrogen. Both 
androgen and estrogen help stimulate bone formation in humans. 
Higher UGT2B17 gene copy number (a one copy) is associated 
with increased risk of osteoporotic hip fracture in Chinese and 
Caucasian populations, while homozygous deletion of UGT2B17 
is a protective factor [23]. Hip fracture rates after age adjustment 
are more common in Scandinavian and North America than in 
Southern Europe, Asia, and Latin America [24] . Interestingly, our 
data correspondingly demonstrated a higher number of East Asian 
populations (CHB, CHD, JPT) with UGT2B17 homozygous 
deletion than Caucasian populations (CEU, TSI). The frequency 
difiference of UGT2B17 homozygous deletion across populations, 
therefore, is consistent with the lower risk of osteoporotic hip 
fracture found in Asian as compared to Caucasian populations. 
The number of Thais with UGT2B17 homozygous deletion falls 
between that of East Asians and Caucasians. However, a large 
molecular epidemiological study is needed to clarify the incidence 
and prevalence of osteoporotic fracture of the hip in Thai 
population and establish the correlation between UGT2B17 copy 
number variation and osteoporosis risk. 

It is kno\vn that there is a subtle genetic difierence within the 
Asian populations that may render genetic information not 
completely interchangeable [25] . A study has shown that although 
the similarity in allele frequency and linkage disequilibrium 
between Thais and East Asians is high, but at least 5% of drug- 
related alleles in Thais are not captured by East Asian-derived 
haplotype-tagging SNPs [26] . In line with the above observation, a 
hierarchical clustering analysis using allele frequencies of CNVs 
containing the 35 top candidate genes across populations could 
successfully separate the 12 study populations into three groups 
according to their ancestral origin; Africans (LWK, YRI, ASW, 
MKK), Europeans (GIH, MEX, TSI, CEU), Asians (Thai, JPT, 
CHB, CPT). Based on the CNV occurrences, Thais were clustered 
near the East Asian populations, yet, were clearly distinguishable 
from them. Hence, cautions should be taken when interpreting 
uncertain clinical significance CNVs found in the Thai population 
using reference CNV databases created based on other non-Thai 
populations. Lack of reference CNVs available in Caucasian- 
populated CNV databases at Thai-specific CNV locations might 
lead to misinterpretation of Thai CNVs from uncertain signifi- 
cance to pathogenic. 

In summary, we have established a reference CNV database for 
Thais, which is the largest of its kind to date. This database will 
serve as a valuable resource of reference CNVs for clinical 
diagnosis of Thai patients with genetic disorders, and to identify 
Thai-specific novel CNVs and CNVRs that were diflferentially 
distributed among other populations. From this study, we have 
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characterized population-specific CNVs supporting the notion that 
a population-specific CNV database will greatly contribute to 
more accurate interpretation of clinical significance of CNVs. 

Supporting Information 

Figure SI MySQL schema for Thai CNV database. 
(TIFF) 

File SI Supporting Tables SI— S5. Table SI: Sample size of 
all populations used in this study. Table S2: Characteristics of 
CNVs discovered by PennCNV (HMM) and CNV Workshop 
(CBS). Table S3: List of 173 genes overlapping CNVs with 
statistically significantly different allele frequencies across Hap- 
Map3 populations as compared with Thais (p-value <0.01). Table 
S4: The 35 genes overlapping CNVs with statistically significanfly 
difierent allele frequencies across HapMapS populations as 
compared with Thais (p-value <0.0002); the pink, blue, and 
yeUow cells represent homozygous or heterozygous deletion 
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